WIP: Pydarshan log_based sorting #949

Yanlilyu · 2023-07-21T17:12:09Z

Description:
This branch adds a log_based routine to help user find the most I/O intensive jobs. The file is job_stats.py in PyDarshan CLI tools.
It combines the aggregated data from multiple log files to a DataFrame, sorts data by the column name the user inputs in a descending order, and then filters the data with the first n (number_of_rows from user input) records. It returns a DataFrame with the n most I/O intensive jobs
User input includes log_path, module, order_by_colname, number_of_rows.
log_path should be a path glob pattern.
order_by_colname should be "agg_perf_by_slowest", "agg_time_by_slowest", or "total_bytes".
Example usage: $ python -m darshan job_stats darshan_logs/nonmpi_workflow/"worker*.darshan" STDIO total_bytes 5

tylerjereddy

I tried playing with it a bit this morning and thought I'd make a few suggestions:

I tried this command: python -m darshan job_stats ~/github_projects/darshan-logs/darshan_logs/mpi_io_test_with_dxt/treddy_mpi-io-test_id4373053_6-2-60198-9815401321915095332_1.darshan STDIO total_bytes 5

And got:

Statistical data of jobs:    agg_perf_by_slowest  agg_time_by_slowest                                  category_counters  ...  unique_io_total_time_by_slowest  unique_md_only_time_by_slowest  unique_rw_only_time_by_slowest
0             16.47327             0.000094  [<cdata 'struct darshan_file_category_counters...  ...                         0.000094                             0.0                        0.000094

Where the columns look a bit misaligned with the data? That's an easy enough fix I think:

--- a/darshan-util/pydarshan/darshan/cli/job_stats.py
+++ b/darshan-util/pydarshan/darshan/cli/job_stats.py
@@ -162,7 +162,7 @@ def main(args: Union[Any, None] = None):
         com_dfs = combined_dfs(list_dfs)
         combined_dfs_sort = sort_data_desc(com_dfs, order_by_colname)
         combined_dfs_selected = first_n_recs(combined_dfs_sort, n)
-        print("Statistical data of jobs:", combined_dfs_selected)
+        print("Statistical data of jobs:\n", combined_dfs_selected)

Statistical data of jobs:
    agg_perf_by_slowest  agg_time_by_slowest                                  category_counters  ...  unique_io_total_time_by_slowest  unique_md_only_time_by_slowest  unique_rw_only_time_by_slowest
0             16.47327             0.000094  [<cdata 'struct darshan_file_category_counters...  ...                         0.000094                             0.0                        0.000094

That looks better!

I also tried running on all of the logs in the logs repo, just to cause trouble:

python -m darshan job_stats ~/github_projects/darshan-logs/darshan_logs/**/*.darshan STDIO total_bytes 5

And ran into an error:

usage: darshan <command> job_stats [-h] log_path module order_by_colname number_of_rows
darshan <command> job_stats: error: argument number_of_rows: invalid int value: '/Users/treddy/github_projects/darshan-logs/darshan_logs/e3sm_io_heatmaps_and_dxt/e3sm_io_heatmap_and_dxt.darshan'

Curiously, if I run on that file directly it seems to work ok, so may be related to the ** glob pattern vs. command line stuff?

Although I don't know Shane/your full vision for the use cases just yet, here's a suggestion to get the regression testing started:

--- /dev/null
+++ b/darshan-util/pydarshan/darshan/tests/test_job_stats.py
@@ -0,0 +1,32 @@
+import argparse
+from unittest import mock
+
+from darshan.log_utils import get_log_path
+from darshan.cli import job_stats
+import pytest
+
+
+@pytest.mark.parametrize(
+    "argv", [
+        [get_log_path("e3sm_io_heatmap_only.darshan"),
+         "STDIO",
+         "total_bytes",
+         "5"],
+    ]
+)
+def test_job_stats(argv, capsys):
+    with mock.patch("sys.argv", argv):
+        # initialize the parser
+        parser = argparse.ArgumentParser(description="")
+        # run through setup_parser()
+        job_stats.setup_parser(parser=parser)
+        # parse the input arguments
+        args = parser.parse_args(argv)
+    job_stats.main(args=args)
+    captured = capsys.readouterr()
+    assert "3.258853" in captured.out
+    # NOTE: probably much better to use i.e.,
+    # pd.read_csv or similar on the string data
+    # and reconstitute a DataFrame where you could
+    # use i.e., assert_frame_equal for floating point
+    # comparisons, etc.

Also, if printing to the terminal is the "final medium" for output, we could look at libraries like https://github.com/Textualize/rich or pandas-related things built on top of that for some nice formatting I suppose.

shanedsnyder · 2023-07-28T18:50:38Z

I agree with Tyler's suggestions above, so if we could incorporate those I think that'd be great.

also tried running on all of the logs in the logs repo, just to cause trouble:

python -m darshan job_stats ~/github_projects/darshan-logs/darshan_logs/**/*.darshan STDIO total_bytes 5

And ran into an error:

Same here. I think this is being caused by shell wildcard expansion -- your shell will automatically expand the pattern above into all of the possible Darshan log files. To this script then, there will be a ton of log files followed by the remainder of the command line arguments which causes the rest of the command line parsing to blow up, obviously. If you enclose your pattern above in quotes (and use your actual home directory name rather than ~), it works fine as no expansion occurs.

Thinking about it more, maybe we should just lean totally on the shell wildcard expansion and not worry about anything glob related in the python script. I think users will be tricked by this too, figuring out the right way to express a pattern to get through the shell all the way to python. This means the script would instead just need to accept a list of log files on the command line, with the user responsible for creating that list (using a shell glob or by manually specifying them). Any objections to that? (Sorry @Yanlilyu, I know this is what you originally started with, but this may have been a bad suggestion on my part...)

If we go that direction, I'd suggest switching the remainder of the command line arguments from positional arguments to named arguments (e.g., -n rows, -m module, etc.). Further, we should probably just give them default values so users don't have to always specify them. We could default to total_bytes, POSIX module, and all rows, for instance?

Also, if printing to the terminal is the "final medium" for output, we could look at libraries like https://github.com/Textualize/rich or pandas-related things built on top of that for some nice formatting I suppose.

I agree, this would be nice, but we can always explore this later. I think the ultimate output format for this tool remains just shell output (it's really just a quick way to scan a bunch of log files for interesting jobs, it doesn't really need a super formal output format), so maybe that would be something like Rich rather than Pandas styling?

shanedsnyder · 2023-07-28T18:54:16Z

Oh, and BTW, after figuring out how to properly format the glob to account for all logs in the Darshan logs repo, I do run into an error that should be properly guarded against:

Traceback (most recent call last):
  File "/usr/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/tmp/darshan/darshan-util/pydarshan/darshan/__main__.py", line 3, in <module>
    main()
  File "/tmp/darshan/darshan-util/pydarshan/darshan/cli/__init__.py", line 164, in main
    mod.main(args)
  File "/tmp/darshan/darshan-util/pydarshan/darshan/cli/job_stats.py", line 162, in main
    df_i = df_IO_data(log_paths[i], mod)
  File "/tmp/darshan/darshan-util/pydarshan/darshan/cli/job_stats.py", line 24, in df_IO_data
    posix_recs = report.records[mod].to_df()
KeyError: 'STDIO'

Not every log is guaranteed to have a given module included in the log file. If a module isn't found in a log file, we should assign it "N/A" values or something like that and put it at the end of the list. Does that seem reasonable?

shanedsnyder · 2023-07-28T18:56:12Z

One final comment after looking this over: Can we drop all other columns from the derived statistics dataframe other than the ones we are concerned with (total_bytes, agg_perf_by_slowest, agg_time_by_slowest)? I don't think they are useful to include and will help make the output more digestible.

shanedsnyder · 2023-08-11T15:13:40Z

We moved this all into one PR: #954

Yanlilyu added 2 commits July 18, 2023 08:25

Log-based sorting

6684c4d

Use a path glob pattern from the user input

06f231b

github-actions bot added the pydarshan label Jul 21, 2023

tylerjereddy reviewed Jul 25, 2023

View reviewed changes

Yanlilyu added 2 commits August 4, 2023 15:02

Test for job_stats.py

4e7266f

Improve codes for command line arguments

f449f22

Yanlilyu closed this Aug 11, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

WIP: Pydarshan log_based sorting #949

WIP: Pydarshan log_based sorting #949

Uh oh!

Yanlilyu commented Jul 21, 2023 •

edited

Loading

Uh oh!

tylerjereddy left a comment

Uh oh!

shanedsnyder commented Jul 28, 2023

Uh oh!

shanedsnyder commented Jul 28, 2023

Uh oh!

shanedsnyder commented Jul 28, 2023

Uh oh!

shanedsnyder commented Aug 11, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

WIP: Pydarshan log_based sorting #949

WIP: Pydarshan log_based sorting #949

Uh oh!

Conversation

Yanlilyu commented Jul 21, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tylerjereddy left a comment

Choose a reason for hiding this comment

Uh oh!

shanedsnyder commented Jul 28, 2023

Uh oh!

shanedsnyder commented Jul 28, 2023

Uh oh!

shanedsnyder commented Jul 28, 2023

Uh oh!

shanedsnyder commented Aug 11, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Yanlilyu commented Jul 21, 2023 •

edited

Loading